The previous chapter showed how fitted Q-iteration handles large state spaces through function approximation. FQI maintains a unified structure across batch and online settings: a replay buffer inducing empirical distribution , a target map derived from the Bellman operator, a loss function , and an optimization budget. Algorithms differ in how they instantiate these components (buffer evolution, hard vs soft Bellman, update frequency), but all follow the same template.
However, this framework breaks down when the action space becomes large or continuous. Computing Bellman targets requires evaluating for each next state . When actions are continuous (), this maximization requires solving a nonlinear program at every target computation. For a replay buffer with millions of transitions, this becomes computationally prohibitive.
This chapter addresses the continuous action problem while maintaining the FQI framework. We develop several approaches, unified by a common theme: amortization. Rather than solving the optimization problem repeatedly at inference time, we invest computational effort during training to learn a mapping that directly produces good actions. This trades training-time cost for inference-time speed.
The strategies we examine are:
Explicit optimization (Section 2): Solve the maximization numerically for a subset of states, accepting the computational cost for exact solutions.
Policy network amortization (Sections 3-5): Learn a deterministic or stochastic policy network that approximates or the optimal stochastic policy, enabling fast action selection via a single forward pass. This includes both hard-max methods (DDPG, TD3) and soft-max methods (SAC, PCL).
Each approach represents a different point in the computation-accuracy trade-off, and all fit within the FQI template by modifying how targets are computed.
Embedded Optimization¶
Recall that in fitted Q methods, the main idea is to compute the Bellman operator only at a subset of all states, relying on function approximation to generalize to the remaining states. At each step of the successive approximation loop, we build a dataset of input state-action pairs mapped to their corresponding optimality operator evaluations:
This dataset is then fed to our function approximator (neural network, random forest, linear model) to obtain the next set of parameters:
While this strategy allows us to handle very large or even infinite (continuous) state spaces, it still requires maximizing over actions () during the dataset creation when computing the operator for each basepoint. This maximization becomes computationally expensive for large action spaces. We can address this by adding another level of optimization: for each sample added to our regression dataset, we employ numerical optimization methods to find actions that maximize the Bellman operator for the given state.
The above pseudocode introduces a generic routine which represents any numerical optimization method that searches for an action maximizing the given function. This approach is versatile and can be adapted to different types of action spaces. For continuous action spaces, we can employ standard nonlinear optimization methods like gradient descent or L-BFGS (e.g., using scipy.optimize.minimize). For large discrete action spaces, we can use integer programming solvers - linear integer programming if the Q-function approximator is linear in actions, or mixed-integer nonlinear programming (MINLP) solvers for nonlinear Q-functions. The choice of solver depends on the structure of our Q-function approximator and the constraints on our action space.
While explicit optimization provides exact solutions, it becomes computationally expensive when we need to compute targets for millions of transitions in a replay buffer. Can we avoid solving an optimization problem at every decision? The answer is amortization.
Amortized Optimization Approach¶
This process is computationally intensive. We can “amortize” some of this computation by replacing the explicit optimization for each sample with a direct mapping that gives us an approximate maximizer directly. For Q-functions, recall that the operator is given by:
If is the optimal state-action value function, then , and we can derive the optimal policy directly by computing the decision rule:
Since is a fixed point of , we can write:
Note that is implemented by our numerical solver in the procedure above. A practical strategy would be to collect these maximizer values at each step and use them to train a function approximator that directly predicts these solutions. Due to computational constraints, we might want to compute these exact maximizer values only for a subset of states, based on some computational budget, and use the fitted decision rule to generalize to the remaining states. This leads to the following amortized version:
Note that the policy is being trained on a dataset containing optimal actions computed with respect to an evolving Q-function. Specifically, at iteration , we collect pairs where . However, after updating to , these actions may no longer be optimal with respect to the new Q-function.
A natural approach to handle this staleness would be to maintain only the most recent optimization data. We could modify our procedure to keep a sliding window of iterations, where at iteration , we only use data from iterations to . This would be implemented by augmenting each entry in with a timestamp:
where indicates the iteration at which the optimal action was computed. When fitting the policy network, we would then only use data points that are at most iterations old:
This introduces a trade-off between using more data (larger ) versus using more recent, accurate data (smaller ). The choice of would depend on how quickly the Q-function evolves and the computational budget available for computing exact optimal actions.
The main limitation of this approach, beyond the out-of-distribution drift, is that it requires computing exact optimal actions via the solver for states in . Can we reduce or eliminate this computational expense? As the policy improves at selecting actions, we can bootstrap from these increasingly better choices. Continuously amortizing these improving actions over time creates a virtuous cycle of self-improvement toward the optimal policy. However, this bootstrapping process requires careful management: moving too quickly can destabilize training.
Deterministic Parametrized Policies¶
In this section, we consider deterministic parametrized policies of the form which directly output an action given a state. This approach differs from stochastic policies that output probability distributions over actions, making it particularly suitable for continuous control problems where the optimal policy is often deterministic. Fitted Q-value methods can be naturally extended to simultaneously learn both the Q-function and such a deterministic policy.
The Amortization Problem for Continuous Actions¶
When actions are continuous, , extracting a greedy policy from a Q-function becomes computationally expensive. Consider a robot arm control task where the action is a -dimensional torque vector. To act greedily given Q-function , we must solve:
where is a continuous set (often a box or polytope). This requires running an optimization algorithm at every time step. For neural network Q-functions, this means solving a nonlinear program whose objective involves forward passes through the network.
After training converges, the agent must select actions in real-time during deployment. Running interior-point methods or gradient-based optimizers at every decision creates unacceptable latency, especially in high-frequency control where decisions occur at 100Hz or faster.
The solution is to amortize the optimization cost by learning a separate policy network that directly outputs actions. During training, we optimize so that for states we encounter. At deployment, action selection reduces to a single forward pass through the policy network: . The computational cost of optimization is paid during training (where time is less constrained) rather than at inference.
This introduces a second approximation beyond the Q-function. We now have two function approximators: a critic that estimates values, and an actor that selects actions. The critic is trained using Bellman targets as in standard fitted Q-iteration. The actor is trained to maximize the critic:
where the expectation is over states in the dataset or replay buffer. This gradient ascent pushes the actor toward actions that the critic considers valuable. By the chain rule, this equals , which can be efficiently computed via backpropagation through the composition of the two networks.
Neural Fitted Q-Iteration for Continuous Actions (NFQCA)¶
NFQCA Hafner & Riedmiller, 2011 extends the NFQI template from the previous chapter to handle continuous action spaces by replacing the operator in the Bellman target with a parameterized policy . This transforms fitted Q-iteration into an actor-critic method: the critic evaluates state-action pairs via the standard regression step, while the actor provides actions by directly maximizing the learned Q-function.
The algorithm retains the two-level structure of NFQI: an outer loop performs approximate value iteration by computing Bellman targets, and an inner loop fits the Q-function to those targets. NFQCA adds a third component (policy improvement) that updates to maximize the Q-function over states sampled from the dataset.
From Discrete to Continuous Actions¶
Recall from the FQI chapter that NFQI computes Bellman targets using the hard max:
When is finite and small, this max is computed by enumeration. When is continuous or high-dimensional, enumeration is intractable. NFQCA replaces the max with a parameterized policy that approximately solves the maximization:
The policy acts as an amortized optimizer: instead of solving from scratch at each state during target computation, we train a neural network to output near-optimal actions directly. The term “amortized” refers to spreading the cost of optimization across training: we pay once to learn , then reuse it for all future target computations.
To train the policy, we maximize the expected Q-value under the distribution of states in the dataset. If we had access to the optimal Q-function , we would solve:
where is the empirical distribution over states induced by the offline dataset . In practice, we use the current Q-function approximation after it has been fitted to the latest targets. The expectation is approximated by the sample average over states appearing in :
This policy improvement step runs after the Q-function has been updated, using the newly-fitted critic to guide the actor toward higher-value actions. Both the Q-function fitting and policy improvement use gradient-based optimization on the respective objectives.
The algorithm structure mirrors NFQI (Algorithm Algorithm 1 in the FQI chapter) with two extensions. First, target computation (line 7-8) replaces the discrete max with a policy network call , making the Bellman operator tractable for continuous actions. Second, after fitting the Q-function (line 11), we add a policy improvement step (line 13) that updates to maximize the Q-function evaluated at policy-generated actions over states in the dataset.
Both fit operations use gradient descent with warm starting, consistent with the NFQI template. The Q-function minimizes squared Bellman error using targets computed with the current policy. The policy maximizes the Q-function via gradient ascent on the composition , which is differentiable end-to-end when both networks are differentiable. The gradient with respect to is:
computed via the chain rule (backpropagation through the actor into the critic). Modern automatic differentiation libraries handle this composition automatically.
Euler Equation Methods: Approximating Policies Instead of Values¶
The weighted residual framework applies to any functional equation arising from an MDP. So far we have applied it to the Bellman equation to approximate the value function. An alternative is to approximate the optimal policy directly by enforcing first-order optimality conditions Judd, 1992Rust, 1996Judd, 1998.
Consider a control problem with continuous states and actions, deterministic dynamics , and reward , both continuously differentiable. For each state , the optimal action satisfies the first-order condition
This involves the unknown value gradient evaluated at the next state. In a general MDP, one would need to solve jointly for both the policy and the value-function derivatives.
The Euler class¶
Many control problems have additional structure that eliminates from the first-order condition. Rust formalizes such problems as an Euler class Rust, 1996: the state can be written as where is controlled (inventory, battery charge, water level) and evolves independently (demand, weather, prices). The transition law factors as
with continuously differentiable. The Euler-class condition requires a function such that
In scalar problems, this means the derivative with respect to the controlled state is proportional to the derivative with respect to the action. For affine dynamics (where may depend on ), we have and , so the condition holds with . This covers inventory (), energy storage (), reservoir (), and thermal models ().
Under this condition, an envelope formula expresses purely in terms of , , and :
Substituting into the first-order condition eliminates entirely, yielding an equation depending only on primitives and the policy . This transforms a coupled system (policy + value) into a closed functional equation in the policy alone.
Discretization¶
For a parameterized policy , we discretize the Euler equation using weighted residuals. With collocation at points , we solve
With Galerkin projection using test functions and weighting measure , we solve
The mapping is nonlinear in , requiring Newton-type methods or other root-finding schemes. Unlike the Bellman operator, this operator is not a contraction, so convergence guarantees are problem-dependent.
Deep Deterministic Policy Gradient (DDPG)¶
We now extend NFQCA to the online setting with evolving replay buffers, mirroring how DQN extended NFQI in the FQI chapter. Just as DQN allowed and to evolve during learning instead of using a fixed offline dataset, DDPG Lillicrap et al., 2015 collects new transitions during training and stores them in a circular replay buffer.
Like DQN, DDPG uses the flattened FQI structure with target networks. But where DQN maintains a single target network for the Q-function, DDPG maintains two target networks: one for the critic and one for the actor . Both are updated periodically (every steps) to mark outer-iteration boundaries, following the same nested-to-flattened transformation shown for DQN.
The online network now plays a triple role in DDPG: (1) the parameters being actively trained ( for critic, for actor), (2) the policy used to collect new data, and (3) the gradient source for policy improvement. The target networks serve only one purpose: computing stable Bellman targets.
Exploration via Action Noise¶
Since the policy is deterministic, exploration requires adding noise to actions during data collection:
where is exploration noise. The original DDPG paper used an Ornstein-Uhlenbeck (OU) process, which generates temporally correlated noise through the discretized stochastic differential equation:
where is the long-term mean (typically 0), controls the strength of mean reversion, scales the random fluctuations, and is the time step. The term acts like damped motion through a viscous fluid: when deviates from , this force pulls it back smoothly without oscillation. The random term adds perturbations, creating noise that wanders but is gently pulled back toward . This temporal correlation produces smoother exploration trajectories than independent Gaussian noise.
However, later work (including TD3, discussed below) found that simple uncorrelated Gaussian noise works equally well and is easier to tune. The exploration mechanism is orthogonal to the core algorithmic structure.
The algorithm structure parallels DQN (Algorithm Algorithm 5 in the FQI chapter) with the continuous-action extensions from NFQCA. Lines 1-5 initialize both networks and their targets, following the same pattern as DQN but with an additional actor network. Line 3 uses the online actor with exploration noise for data collection, replacing DQN’s -greedy selection. Line 7 computes targets using both target networks: the actor target selects the next action, the critic target evaluates it. This replaces the operator in DQN. Lines 8-9 update both networks: critic via TD error minimization, actor via policy gradient through the updated critic. Line 10 performs periodic hard updates every steps, marking outer-iteration boundaries.
The policy gradient in line 9 uses the chain rule to backpropagate through the actor-critic composition:
This is identical to the NFQCA gradient, but now computed on mini-batches sampled from an evolving replay buffer rather than a fixed offline dataset. The critic gradient at the policy-generated action provides the direction of steepest ascent in Q-value space, weighted by how sensitive the policy output is to its parameters via .
Twin Delayed Deep Deterministic Policy Gradient (TD3)¶
DDPG inherits the overestimation bias from DQN’s use of the max operator in Bellman targets. TD3 Fujimoto et al., 2018 addresses this through three modifications to the DDPG template, following similar principles to Double DQN but adapted for continuous actions and taking a more conservative approach.
Twin Q-Networks and the Minimum Operator¶
Recall from the Monte Carlo chapter that overestimation arises when we use the same noisy estimate both to select which action looks best and to evaluate that action. Double Q-learning breaks this coupling by maintaining two independent estimators with noise terms and :
When and are independent, the tower property of conditional expectation gives because (determined by ) is independent of . This eliminates evaluation bias: we no longer use the same positive noise that selected an action to also inflate its value. By conditioning on the selected action and then taking expectations over the independent evaluation noise, the bias in the evaluation term vanishes.
Double DQN (Algorithm Algorithm 6) implements this principle in the discrete action setting by using the online network for selection () and the target network for evaluation (). Since these networks experience different training noise, their errors are approximately independent, achieving the independence condition needed to eliminate evaluation bias. However, selection bias remains: the argmax still picks actions that received positive noise in the selection network, so .
TD3 takes a more conservative approach. Instead of decoupling selection from evaluation, TD3 maintains twin Q-networks and trained on the same data with different random initializations. When computing targets, TD3 uses the target policy to select actions (no maximization over a discrete set), then takes the minimum of the two Q-networks’ evaluations:
where . This minimum operation provides a pessimistic estimate: if the two Q-networks have independent errors and , then , producing systematic underestimation rather than overestimation.
The connection to the conditional independence framework is subtle but important. While Double DQN uses independence to eliminate bias in expectation (one network selects, another evaluates), TD3 uses independence to construct a deliberate lower bound. Both approaches rely on maintaining two Q-functions with partially decorrelated errors, achieved through different initializations and stochastic gradients during training, but they aggregate these functions differently. Double DQN’s decoupling targets unbiased estimation by breaking the correlation between selection and evaluation noise. TD3’s minimum operation targets robust estimation by taking the most pessimistic view when the two networks disagree.
This trade-off between bias and robustness is deliberate. In actor-critic methods, the policy gradient pushes toward actions with high Q-values. Overestimation is particularly harmful because it can lead the policy to exploit erroneous high-value regions. Underestimation is generally safer: the policy may ignore some good actions, but it will not be misled into pursuing actions that only appear valuable due to approximation error. The minimum operation implements a “trust the pessimist” principle that complements the policy optimization objective.
TD3 also introduces two additional modifications beyond the clipped double Q-learning. First, target policy smoothing adds clipped noise to the target policy’s actions when computing targets: . This regularization prevents the policy from exploiting narrow peaks in the Q-function approximation error by averaging over nearby actions. Second, delayed policy updates change the actor update frequency: the actor updates every steps instead of every step. This reduces per-update error by letting the critics converge before the actor adapts to them.
TD3 also replaces DDPG’s hard target updates with exponential moving average (EMA) updates, following the smooth update scheme from Algorithm Algorithm 4 in the FQI chapter. Instead of copying every steps, EMA smoothly tracks the online network: at every update. For small , the target lags behind the online network by roughly steps, providing smoother learning dynamics.
The algorithm structure parallels Double DQN but with continuous actions. Lines 8.1-8.2 implement clipped double Q-learning: smoothing adds noise to target actions (preventing exploitation of Q-function artifacts), and the min operation (highlighted in blue) provides pessimistic value estimates. Both critics update toward the same shared target (lines 10-11), but their different initializations and stochastic gradient noise keep their errors partially decorrelated, following the same principle underlying Double DQN’s independence assumption. Line 13 gates policy updates to every steps (typically ), and lines 13.2-13.4 use EMA updates following Algorithm Algorithm 4.
TD3 simplifies exploration by replacing DDPG’s Ornstein-Uhlenbeck process with uncorrelated Gaussian noise (line 5.3). This eliminates the need to tune multiple OU parameters while providing equally effective exploration.
Soft Actor-Critic¶
DDPG and TD3 address continuous actions by learning a deterministic policy that amortizes the operation. But deterministic policies have a fundamental limitation: they require external exploration noise (Gaussian perturbations in TD3) and can converge to suboptimal deterministic behaviors without adequate coverage of the state-action space.
The smoothing chapter presents an alternative: entropy-regularized MDPs, where the agent maximizes expected return plus a bonus for policy randomness. This yields stochastic policies with exploration built into the objective itself. The smooth Bellman operator replaces the hard max with a soft-max:
where is the inverse temperature and is the entropy regularization weight. For finite action spaces, this log-sum-exp is easy to compute. But for continuous actions , the sum becomes an integral:
This integral is intractable. We face an infinite-dimensional sum over the continuous action space. The very smoothness that gives us stochastic policies creates a new computational barrier, distinct from but analogous to the problem in standard FQI.
From Intractable Integral to Tractable Expectation¶
Soft actor-critic (SAC) Haarnoja et al., 2018Haarnoja et al., 2018 exploits an equivalence between the intractable integral and an expectation. The optimal policy under entropy regularization is the Boltzmann distribution:
Under this policy, the soft value function becomes:
We have converted an intractable integral into an expectation that we can estimate by sampling. The catch: we need samples from , which depends on the we are trying to learn.
SAC uses the same policy amortization strategy as DDPG: learn a parametric policy that approximates the optimal stochastic policy (the Boltzmann distribution). The policy enables fast action selection through a single forward pass rather than solving an optimization problem. Exploration comes from the policy’s stochasticity rather than from external noise.
Bootstrap Targets via Single-Sample Estimation¶
With a learned policy , we can compute Q-function bootstrap targets. For a transition , we need the soft value at :
SAC estimates this with a single Monte Carlo sample: draw and approximate:
The minimum over twin Q-networks applies the clipped double-Q trick from TD3. This single-sample approach is computationally efficient: each target requires just one policy sample and two Q-network evaluations.
Historical Note: The V-Network in Original SAC
The original SAC paper Haarnoja et al., 2018 introduced a separate value network trained to predict the entropy-adjusted expectation, amortizing the soft value computation into a single forward pass. Bootstrap targets then used .
However, the follow-up paper Haarnoja et al., 2018 showed this V-network is redundant: the single-sample estimate works just as well while simplifying the architecture. All modern implementations — OpenAI Spinning Up, Stable Baselines3, CleanRL — omit the V-network. We present this simplified version throughout.
Learning the Policy: Matching the Boltzmann Distribution¶
The Q-network update assumes a policy that approximates the Boltzmann distribution . Training such a policy presents a problem: the Boltzmann distribution requires the partition function , the very integral we are trying to avoid. SAC sidesteps this by minimizing the KL divergence from the policy to the (unnormalized) Boltzmann distribution:
Since does not depend on , this reduces to:
This pushes probability toward high Q-value actions while the term penalizes concentrating probability mass, maintaining entropy. The entropy bonus comes from the KL divergence structure rather than from an explicit regularization term.
To estimate gradients of this objective, we face a technical problem: the policy parameters appear in the sampling distribution , making difficult to compute. SAC uses a Gaussian policy with the reparameterization trick. Express samples as a deterministic function of parameters and independent noise:
This moves out of the sampling distribution and into the integrand:
We can now differentiate through and the Q-network, as DDPG differentiates through a deterministic policy. SAC extends this by sampling noise at each gradient step rather than outputting a single deterministic action.
The algorithm interleaves three updates. The Q-networks (lines 7-10) follow fitted Q-iteration with the soft Bellman target: sample a next action from the current policy, compute the entropy-adjusted target , and minimize squared error. The minimum over twin Q-networks mitigates overestimation as in TD3. The policy (lines 12-13) updates to match the Boltzmann distribution induced by the current Q-function, using the reparameterization trick for gradient estimation. Target networks update via EMA (lines 15-16) to stabilize training.
The stochastic policy serves the same amortization purpose as in DDPG and TD3: it replaces the intractable operation with a fast network forward pass. SAC’s entropy regularization produces exploration through the policy’s inherent stochasticity rather than external noise. This makes SAC more robust to hyperparameters and eliminates the need to tune exploration schedules.
Path Consistency Learning (PCL)¶
DDPG, TD3, and SAC all follow the same solution template from fitted Q-iteration: compute Bellman targets using the current Q-function, fit the Q-function to those targets, repeat. This is successive approximation, the function iteration approach from the projection methods chapter.
Path Consistency Learning (PCL) Nachum et al., 2017 solves the Bellman equation differently. Instead of iterating the operator, it directly minimizes a residual. This is the least-squares approach from projection methods: solve by minimizing . The method exploits special structure (smooth Bellman operators under deterministic dynamics) that conventional methods cannot leverage.
The Path Consistency Property¶
Consider the entropy-regularized Q-function Bellman equation from the smoothing chapter. Under general stochastic dynamics, it involves an expectation over next states:
Suppose the dynamics are deterministic: . The next state is uniquely determined, so the expectation disappears:
The value function relates to Q-functions through the soft-max:
To understand what makes PCL work, we need to contrast two cases: general policies versus the optimal Boltzmann policy.
For general policies, the value equals an expectation:
This is an average. For a single observed action , we have:
where is sampling error with . Individual actions give noisy estimates that fluctuate around the mean.
For the optimal policy under entropy regularization, the Boltzmann structure produces an exact pointwise identity. The optimal policy is:
Taking logarithms and rearranging:
This holds exactly for every action , not just in expectation. There is no sampling error. The advantage is precisely encoded in the log-probability: suboptimal actions have low but also large (low probability means large negative log-probability), and these terms balance exactly to give .
Now take a trajectory segment where each transition follows the deterministic dynamics . Start with and use equation (42) to substitute exactly:
Substitute :
Continue this telescoping for steps. Each substitution is exact:
Apply equation (42) once more to get :
Rearranging gives the path consistency residual:
This is exact, not approximate. The telescoping produces an identity: for every action sequence, not just in expectation. This is what enables off-policy learning without importance sampling. The behavior policy never appears because the constraint holds as a deterministic identity for any observed .
Remark 1 (Contrasting General Policies and Optimal Boltzmann Policies)
The distinction between equations (39) and (42) is subtle but crucial.
For general policies (equation (39)), the value is an average over actions sampled from the policy. Individual actions give noisy estimates: if we draw , then where is a zero-mean random variable. We need to average many samples to estimate accurately. Multi-step telescoping would accumulate these sampling errors , producing noisy residuals even at the true solution. Off-policy learning would require importance weights to correct for using actions from a different behavior policy.
For the optimal entropy-regularized policy (equation (42)), the Boltzmann structure collapses the expectation to a pointwise identity. The relationship holds exactly for every action , optimal or not. A suboptimal action has low (low expected return) and low (low probability), making large. These terms balance precisely to give . No sampling error exists. The telescoping is exact, producing a residual that equals zero for every action sequence, not just in expectation. Off-policy learning works because the constraint holds as a deterministic identity for any observed path.
This property is unique to soft-max operators. For hard-max, holds only when is optimal. Suboptimal actions satisfy , an inequality that cannot be used to construct a residual.
Why Deterministic Dynamics and Entropy Regularization Are Both Required¶
PCL’s two structural requirements—deterministic dynamics and entropy regularization—are not arbitrary design choices. Each addresses a fundamental theoretical issue.
Deterministic Dynamics: Avoiding the Double Sampling Problem¶
Under stochastic dynamics, the Q-function Bellman equation has an expectation over next states:
The exact relationship (42) still holds, so we can write the path consistency constraint. But now consider what PCL minimizes: the squared residual where
At the true optimum , the constraint is , which implies . But PCL minimizes , and by Jensen’s inequality:
with equality only when has zero variance. Under stochastic dynamics, even at optimality, individual trajectory residuals are random variables with mean zero but positive variance (due to transition noise). Minimizing to zero would require driving , which is impossible and pushes the solution away from the true optimum.
This is Baird’s double sampling problem Baird, 1995. To get an unbiased gradient of , we need:
This requires two independent samples of the next state from the same pair: one for estimating and one for . With a simulator, this is possible. With real trajectories, it is not.
Under deterministic dynamics, is deterministic (no transition noise), so and Jensen’s inequality holds with equality. Minimizing the squared residual is equivalent to solving .
Entropy Regularization: Enabling All-Action Consistency¶
To see why entropy regularization is necessary, attempt the same path consistency derivation with the hard-max Bellman operator. Under deterministic dynamics, the Q-function satisfies:
where and the optimal policy is (deterministic).
Now try to relate to an arbitrary observed action . For the optimal action , we have:
But for a suboptimal action :
This is an inequality, not an equation. There is no formula expressing in terms of for suboptimal actions.
Attempt the multi-step telescoping. Start with . To continue, we need to express using the observed action . But we only have:
with equality only if happens to be optimal at . We cannot substitute this into the Q-function equation to get an exact telescoping. The derivation breaks at the first step.
Compare this to the soft-max case. The Boltzmann structure gives equation (42): for all actions . The log-probability term compensates exactly for suboptimality: low-probability actions have large , which adds to the low to recover . This enables exact substitution at every step:
The telescoping proceeds without inequalities or restrictions on which actions were chosen. This is why multi-step hard-max Q-learning lacks theoretical justification for off-policy data. When we observe a trajectory with suboptimal actions, we cannot write an exact path consistency constraint.
Both requirements are structural:
| Requirement | Addresses |
|---|---|
| Deterministic dynamics | Double sampling bias: ensures |
| Entropy regularization | All-action consistency (equation (42)) |
Without deterministic dynamics, residual minimization is biased. Without entropy regularization, the constraint holds only for optimal actions.
The Learning Objective¶
Equation (47) provides a constraint that the optimal must satisfy: the residual equals zero for every observed path. For parametric approximations that are not yet optimal, the residual is nonzero:
PCL minimizes the squared residual over observed path segments:
This is the least-squares residual approach from the projection methods chapter. SAC computes targets and fits to them (successive approximation). PCL directly minimizes the residual without computing targets or performing a separate fitting step.
Gradient descent gives:
where . Large residuals drive larger updates.
The algorithm collects trajectories from the current policy and stores them in a replay buffer. At each iteration, it samples a trajectory (possibly old) and performs gradient descent on the path residual for all -step segments. The replay buffer enables off-policy learning: trajectories from old policies, expert demonstrations, or exploratory behavior all provide valid training signals.
Unified Parameterization: Single Q-Network¶
Algorithm Algorithm 7 uses separate networks for policy and value. But we can use a single Q-network and derive both:
The path residual becomes:
and the gradient combines both value and policy contributions through the same parameters. This unified architecture eliminates the actor-critic separation: one Q-network serves both roles.
Connection to Existing Methods¶
Single-step case (): The path residual becomes . For unified parameterization where exactly, this becomes , the soft Bellman residual. Minimizing is equivalent to soft Q-learning, though SAC solves this via successive approximation (compute targets, fit) rather than direct residual minimization.
No entropy (): The residual becomes , the negative -step advantage. But unlike A2C/A3C where tracks the current policy’s value, PCL’s value converges to because the residual couples policy and value through the optimality condition.
Multi-step with hard-max: No analog exists. The hard-max Bellman operator does not have an exact pointwise relationship like equation (42). Multi-step telescoping would accumulate errors from the max operator, making the constraint valid only in expectation under the optimal policy. The soft-max structure is what enables exact off-policy path consistency.
PCL vs SAC: Residual Minimization vs Successive Approximation¶
Both methods solve entropy-regularized MDPs but use fundamentally different solution strategies:
| Aspect | SAC | PCL |
|---|---|---|
| Solution method | Successive approximation: compute targets , fit to targets | Residual minimization: minimize directly |
| Update structure | Target computation + regression step | Single gradient step on squared residual |
| Target networks | Required (mark outer-iteration boundaries) | None (residual constraint, not target fitting) |
| Temporal horizon | Single-step TD: | Multi-step paths: accumulate over steps |
| Off-policy handling | Replay buffer with single-sample bias | No importance sampling (works for any trajectory) |
| Dynamics requirement | General stochastic transitions | Deterministic transitions |
| Architecture | Twin Q-networks + policy network | Single Q-network (unified parameterization) |
PCL requires deterministic dynamics. It gains multi-step telescoping and off-policy learning without importance weights, but only for deterministic systems (robotic manipulation, many control tasks). SAC works for general stochastic MDPs.
PCL as Amortization¶
PCL amortizes at a different level than DDPG/TD3/SAC. Those methods amortize the action maximization: learn a policy network that outputs directly. PCL amortizes the solution of the Bellman equation itself. Instead of repeatedly applying the Bellman operator (which requires at every iteration), PCL samples path segments and minimizes their residual. The computational cost of verifying optimality across all states and path lengths is distributed across training through sampled gradient updates.
Summary¶
This chapter addressed the computational barrier that arises when extending value-based fitted methods to continuous action spaces. The core issue is tractability: computing at each Bellman target evaluation requires solving a nonlinear optimization problem. For replay buffers containing millions of transitions, repeatedly solving these optimization problems becomes prohibitive.
The solution we developed is amortization: invest computational effort during training to learn a policy network that replaces runtime optimization with fast forward passes. This strategy keeps us firmly within the dynamic programming framework. We still compute Bellman operators and maintain Q-functions. The amortization makes these operations tractable by replacing explicit operations with learned policy networks.
Most methods (NFQCA, DDPG, TD3, SAC) follow the successive approximation paradigm: compute Bellman targets, fit to targets, repeat. PCL takes a different approach, directly minimizing a residual (the path consistency residual) rather than iterating the Bellman operator. This aligns PCL with least-squares residual methods from the projection methods chapter, rather than function iteration.
All methods share the core amortization idea but differ along several dimensions:
Solution methodology. NFQCA, DDPG, TD3, and SAC use successive approximation (function iteration): compute Bellman targets, fit to targets, repeat. This is the paradigm from the projection methods chapter. PCL uses residual minimization: directly minimize the squared path residual via gradient descent. This is the least-squares approach where we solve by minimizing rather than iterating the operator.
Policy class: Deterministic policies (NFQCA, DDPG, TD3) output a single action and approximate by maximizing . This requires external exploration noise during training. Stochastic policies (SAC, PCL) output a distribution and approximate the Boltzmann distribution under entropy regularization. Exploration comes from sampling the stochastic policy.
Temporal structure: Single-step methods (NFQCA, DDPG, TD3, SAC) use one-step Bellman backups with targets . SAC estimates via single-sample Monte Carlo. PCL exploits deterministic dynamics to chain the Bellman equation over steps, using observed action sequences and accumulating rewards along entire path segments.
Target networks: Methods based on successive approximation (NFQCA, DDPG, TD3, SAC) use target networks that mark outer-iteration boundaries in the flattened FQI structure. PCL has no target networks because it minimizes a residual rather than fitting to computed targets. The twin Q-network trick (TD3, SAC) uses to mitigate overestimation; PCL avoids this issue through the residual minimization structure.
All methods remain fundamentally value-based: they maintain Q-functions, compute approximate Bellman operators, and derive policies from learned value estimates. This connection to dynamic programming provides theoretical grounding (we know these methods implement approximate successive approximation of the Bellman equation) but also imposes structure. The policy must track the Q-function’s implied greedy policy (in DDPG/TD3) or Boltzmann distribution (in SAC). When the Q-function is inaccurate, which is inevitable with function approximation, the policy inherits these errors.
The next chapter takes a different perspective. Rather than extending DP-based value methods to continuous actions through amortization, we parameterize the policy directly and optimize it via gradient ascent on expected return. This shifts the fundamental question from “how do we compute efficiently?” to “how do we estimate accurately?” The resulting policy gradient methods are DP-agnostic: they work without Bellman equations, Q-functions, or value estimates. This removes the scaffolding of dynamic programming while introducing new challenges in gradient estimation and variance reduction.
- Hafner, R., & Riedmiller, M. (2011). Reinforcement learning in feedback control: Challenges and benchmarks from technical process control. Machine Learning, 84(1–2), 137–169. 10.1007/s10994-011-5235-x
- Judd, K. L. (1992). Projection methods for solving aggregate growth models. Journal of Economic Theory, 58(2), 410–452.
- Rust, J. (1996). Chapter 14 Numerical dynamic programming in economics. In Handbook of Computational Economics (pp. 619–729). Elsevier. 10.1016/s1574-0021(96)01016-7
- Judd, K. L. (1998). Numerical Methods in Economics. MIT Press.
- Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous Control with Deep Reinforcement Learning. arXiv Preprint arXiv:1509.02971.
- Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. International Conference on Machine Learning (ICML), 1587–1596.
- Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35th International Conference on Machine Learning (ICML), 1861–1870.
- Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic algorithms and applications. arXiv Preprint arXiv:1812.05905.
- Nachum, O., Norouzi, M., Xu, K., & Schuurmans, D. (2017). Bridging the Gap Between Value and Policy Based Reinforcement Learning. Advances in Neural Information Processing Systems, 30, 2775–2785.
- Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. Proceedings of the Twelfth International Conference on Machine Learning, 30–37.